========================================================
Load the data! And summarize some of the variables.
## event_id fatality_count injury_count
## Min. : 1 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2785 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 5563 Median : 0.000 Median : 0.000
## Mean : 5599 Mean : 3.219 Mean : 0.752
## 3rd Qu.: 8435 3rd Qu.: 1.000 3rd Qu.: 0.000
## Max. :11221 Max. :5000.000 Max. :374.000
## NA's :1385 NA's :5674
## admin_division_population gazeteer_distance longitude
## Min. : 0 Min. : 0.000 Min. :-179.98
## 1st Qu.: 1963 1st Qu.: 2.364 1st Qu.:-107.87
## Median : 7365 Median : 6.255 Median : 19.69
## Mean : 157760 Mean : 11.874 Mean : 2.52
## 3rd Qu.: 34021 3rd Qu.: 15.816 3rd Qu.: 93.95
## Max. :12691836 Max. :215.449 Max. : 179.99
## NA's :1562 NA's :1562
## latitude
## Min. :-46.77
## 1st Qu.: 13.92
## Median : 30.53
## Mean : 25.88
## 3rd Qu.: 40.87
## Max. : 72.63
##
From above I see that the fatality rate is maxed at 5000, this is a outlier as the mean is much lower. Also ill exclude any zero values since more than 50% of the data has a zero fatality rate, which is good of course. And not to forget ill also remove all NA values from my new subset.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 3.00 12.72 7.00 5000.00
Now that makes more sense! Now ill complete a histogram with the new subset to see how the data is spread out.
So, it seems that the fatalities are quite low. Lets explore the injury count and if it has a correlation with fatality count? Here ill remove all zero values and also all NA values from my new subset.
Great now what about the administrative division population?
Interesting so most events occur when the administrative division population is around 10000, ill explore more of this later on.
Now lets inspect when landslides may occur! To do this I extract the date from the event_date variable.
So there is a slight rise in the number of events in the middle of the year compared to other months. Meanwhile I expected that most events would be recorded in 2016 or 2017 as the number of people reporting mass movements increases, but apparently there has been some sort of flattening in reported mass movements since 2013.
Now I want to investigate events where there are fatalities!
Now this is interesting! Summer months in the northern hemisphere or winter in the southern hemisphere have the most number of deadly events. Meanwhile the amount of fatal events per year have stayed mostly confusing. Maybe see when the most deadly events occurred.
OK, so 2010 was the year with the most 1% deadly mass movements. Why is this, because of landslide size maybe? I can see that ill explore more of this later.
Now what do some of the other categorical variables look like?
OK great, so medium sized landslides are the most frequent, but what does that even mean, I don’t know how big a medium sized landslide is… ill want to plot how many fatalities occur in the different types of landslide sizes to identify what this exactly means. Addressing landslide triggers, downpour, rain and continuous rain seem to be most frequent triggers, they all involve moisture and the distinction between each trigger category is quite confusing as they are all pretty much the same. Furthermore landslides above roads are the most frequent and the most occurring mass movements are landslides and mudslides.
Where did the number of mass movements increase?
Hmm obviously that would happen, ill only include countries with 100 or more landslides.
So the US has the most landslide, followed by India and the Philippines
I just got the idea of using the storm_name variable to see which storms caused the most landslides. Also I want to see only storms with more than 5 resultant landslides.
This gets me to the idea of maybe extracting the beginning of each storm name to count the events of a typhoon, hurricane, and tropical storm/cyclone. I think I have to use regular expressions for this…
This graph uses all mass movement events
While this graph uses only events where fatalities occurred
Aha! So while Tropical Storms/Cyclones create the most mass movement events, Typhoons which occur in the West Pacific, seem to create most of the fatal landslides.
Also, I can still explore the latitude variable…
Very interesting, so there are almost no mass movement in the Southern Hemisphere compared to the Northern Hemisphere. I’m sure I can explore this more in the next sections and see how latitude might correlate to fatalities or something like that.
The data set has three important numerical variables: fatality_count, injury_count and admin_division_population. There are also numerous factors such as landslide_size, landslide_trigger and country_name. However none of these factors are ordered.
Through multiple univariate plots I observed that: * The most frequent month for mass movement is July, while most fatal mass movements occur in August. * Most mass movements occurred in 2010. * The US boasts the most mass movement events. * 75% of the data has a fatality rate below 1. * The Maximum fatality count is 5000 whilst the maximum injury count is at 374. * Medium sized mass movement is the most common * Most mass movement occurs between 25 and 50 degrees north of the equator * Landslides are the most common mass movement * Typhoons result in most of the deadly mass mass movements as a result of storms.
The fatality_count, landslide_size, year and month variables are the most interesting. I can use these can use these to create further plots later on.
I can see that the latitude and injury_count variables might come into play later in some regression models. ### Did you create any new variables from existing variables in the dataset?
I created the year and month discrete variables, which were extracted from the event_date time stamp in the original data.
I also created the storm_type variable which extracts the first word from the storm_name variable. But frustratingly many of the storm_name values do not have a first name that correspond with the type of storm. For example the name of a storm might be “Haiyan” instead of “Typhoon Haiyan”.
The deadly mass movements per year plot looked quite unusual, with events peaking in 2010 after previously quite few deadly mass movements. To investigate I plotted only the 1% most deadly mass movements and this revealed that this might be due to the a spike in deadly events in 2010.
I also plotted injury and fatality count on a log10 scale, as the data was skewed to the right.
First, lets see how injury_count and fatality_count relate.
So now on to the correlation between injury and fatalities
##
## Pearson's product-moment correlation
##
## data: landslide_data$fatality_count and landslide_data$injury_count
## t = 14.69, df = 5349, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1710286 0.2225413
## sample estimates:
## cor
## 0.1969208
Hmm seems as the correlation here is low. So I guess I was wrong assuming that there is much correlation. However after around 1000 deaths, the data starts getting inaccurate and there are far to few values and wide spaced outliers, so what is the correlation of this with a fatality count below 1000?
##
## Pearson's product-moment correlation
##
## data: fi.cor$fatality_count and fi.cor$injury_count
## t = 19.074, df = 977, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4737179 0.5651222
## sample estimates:
## cor
## 0.5209117
Now this looks better! This does tell us that the injury and fatality count is somewhat correlated, at least when not influenced by outliers.
Can we use the newly created vector fi_cor to make a statistical model predicting fatalities given the injuries.
##
## Call:
## lm(formula = fatality_count ~ injury_count, data = fi.cor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -160.460 -4.901 -3.901 -1.283 274.099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.90073 0.71389 8.266 4.5e-16 ***
## injury_count 0.69137 0.03625 19.074 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.97 on 977 degrees of freedom
## (1460 observations deleted due to missingness)
## Multiple R-squared: 0.2713, Adjusted R-squared: 0.2706
## F-statistic: 363.8 on 1 and 977 DF, p-value: < 2.2e-16
Well that says that for every injury accounted for, there are 0.72 deaths. Meaning that there are roughly 18 deaths for 25 injuries.
Now what about fatalities varying over time? I saw earlier how the count of landslides changed over the different years and months, but how do the fatalities vary over these periods?
The number of deaths peaks in 2013, which is interesting as most mass movements occurred in 2010, ill want to explore more of this… I was wrong assuming that fatalities peaked in August when I first analysed the count of mass movement for each month. Instead the fatalities peak in June and then peak again in August, giving the graph a bi modal form.
OK, from the univariate analysis I saw that of 1% most deadly mass movements, 6 occur in 2010. Lets see how many deaths each of these 1% most deadly mass movements claimed. This might also help us explain why the fatalities peaked in 2013…
Wow, OK this gives us some insight. 2010 might have been the year with the most fatal landslides, but it certainly has not resulted in the most deaths. This explains why 2013 has the highest most fatalities, this is because it hosts the single most fatal event, leading to the loss of about 5000 lives. 2014 also harbored a major deadly event which claimed the lives of around 2000 people.
We also explored administrative division population earlier on, is there a correlation between fatality count and administrative division population? I want to exclude populations below 1000.
OK, so it looks like there is almost no correlation between these to variables.
What are the average and median fatalities per landslide type?
##
## complex creep
## 1 73 0
## debris_flow earth_flow lahar
## 15 1 1
## landslide mudslide other
## 1817 408 18
## riverbank_collapse rock_fall snow_avalanche
## 3 88 7
## topple translational_slide unknown
## 0 2 8
What was I thinking, I should have used a box plot… (Here the mean of debris flow is extending past my maximum y limit)
Great stuff… so looks like of my chosen categories, snow avalanches are the most deadly by median fatalities, whilst as known from the plot before, debris flow has the highest mean fatality rate.
So now I’ll look back at the countries with more than 100 mass movement events and see which countries have the most fatalities.
So despite the fact that the US had the most mass movement events, India, China and the Philippines seem to have the most fatalities. In fact most less economically developed in this graph show extremely large fatality counts compared to more economically developed countries.
To see where deadly events occur better, I can split the latitude values into northern and southern hemisphere.
Summary of each hemisphere:
## fatality_subset$hemisphere: Northern Hemisphere
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 3.00 13.04 7.00 5000.00
## --------------------------------------------------------
## fatality_subset$hemisphere: Southern Hemisphere
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 4.00 11.01 8.00 424.00
Records of mass movements with fatalities:
##
## Northern Hemisphere Southern Hemisphere
## 2057 385
Hmm, we can also use both the latitude and longitude values to make a scatter plot of the mass movement events.
That isn’t very clear… I’ll want to create a proper map of this later on. But we can see that mass movements are mainly concentrated in East Asia and North America.
I now want to go on to analyse how the other categorical variables
So it looks like landslide and mudslide events lead to most deaths, this is due to many deadly events. Meanwhile debris flow is also deadly however this is as a result of one very deadly event.
Just as we saw in univariate plots with the amount of mass movements, the effects of moisture (downpour, rain, etc.) lead to most deaths.
## fatality_subset$landslide_size:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 2.75 3.50 3.50 4.25 5.00
## --------------------------------------------------------
## fatality_subset$landslide_size: catastrophic
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 53.5 103.0 145.3 216.0 329.0
## --------------------------------------------------------
## fatality_subset$landslide_size: large
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 4.00 11.00 19.45 21.00 253.00
## --------------------------------------------------------
## fatality_subset$landslide_size: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 3.00 5.12 5.00 280.00
## --------------------------------------------------------
## fatality_subset$landslide_size: small
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 2.438 3.000 16.000
## --------------------------------------------------------
## fatality_subset$landslide_size: unknown
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 4.621 4.000 40.000
## --------------------------------------------------------
## fatality_subset$landslide_size: very_large
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 5 35 221 91 5000
I questioned what the mass movement sizes actually mean in the univariate analysis, so looking at my summary and plot, it looks like it depends on the median deaths. The plot also shows us that the very_large mass movements are most deadly, followed by medium and large sized mass movements.
Now I want to quickly also analyse how the storm types vary in causing mass movement deaths
Well, this isn’t too surprising, hurricanes normally impact quite developed nations such as the US in the North-West Atlantic. Meanwhile cyclones and typhoons mostly impact less developed nations, explaining the fatality deviations quite well.
Injury and fatality are somewhat correlated, this should make sense. However the injuries only explain about 28% of the variance in fatalities. Surprisingly, administrative division population had absolutely no correlation with the number of deaths. There is also a a surprising relationship between the size of mass movement and the fatality count. The higher the fatality count, the larger the mass movement.
The relationship between injury and fatality count was obviously the strongest. However due to outliers and many NA values, the relationship has become quite complicated.
So earlier on, we were looking at the fatalities for the 1% most deadly mass movements, lets continue exploring this. What were these events, what were they were triggered by and where did they occur.
Interesting, so these top 1% deadly events occur mainly through debris flows and landslide. Also the main triggers for the most deadly seem to be do to heavy downpours and continuous rain. Seismic activities or large scale atmospheric disturbances such as tropical cyclones and monsoon rainfall appear to have a lesser effect. Also the most deadly mass movements seem to be spread around the world quite a lot. The 2013 events is unsurprisingly one in India, I was thinking that it might be in China or India. However I must say that most deadly events occur in Asia, especially areas in the subtropics and tropics.
Now, how have fatalities altered over the years, this time plotted on a frequency polygon?
Well that doesn’t reveal anything… so let me see if something changes if I group the dates into more broad timescales!
Oh wow! That already says a lot! So has mass movement increased in frequency over recent years or have population densities in vulnerable areas increased or maybe have just more fatalities been reported?
In the very beginning I saw that deadly mass movements increase in frequency during June, July and August. I assumed it was due to the Monsoon season in the northern hemisphere, but was this assumption correct?
Yes. seems like that assumption was correct. The northern hemisphere does experience more deadly mass movement in its summer months.
I want to go more in depth with the two most occurring mass movement categories. Over the years, how have landslides and mudslides changed? Has their occurrence increased, have the deaths increased? In what months do these events occur and in which countries?
Smoothing the plot reveals that landslides have become less deadly since 2007 by almost half. Meanwhile mudslide deaths peaked in 2007 then fell considerably from 2015 to present day.
So we now know that fatalities have changed, but how have the number of landslides changed?
Now this is very interesting! The data shows that while the number of these two mass movements has increased, the deaths in of these have decreased since 2007 as shown in the previous plots.
Expanding onto what I saw in the bi variate plots section, where I identified that medium sized mass movements are the only the second most deadly after very large mass movements Is this also true for mass movements which claim less lives?
So it looks like medium sized landslides are the major killer among the less deadly mass movements.
For countries having more than 100 mass movement events, what are their most frequent and most deadly mass movements?
Nice, so landslides tend to occur as the major category in all of these countries, the US also seems to have a sizable amount of mudslides. Now how do the fatalities from mass movements compare?
Looks like the most deadly mass movement category in India was a single debris flow event, overall it seems like landslide was the biggest killer in most of these countries though.
Now I want to see the mass movement in all countries, to do this I need to create maps. But before I start creating proper maps, I want to create maps showing the precise locations of mass movement using the latitude and longitude variables.
OK, so now I might be getting a bit ambitious but what about plotting all this on a proper map? First I need to look how I might do that…
OK, so I got it! Thanks to THIS blog post by Brennon Borbon!
WOW, finally! Notice how the countries with the most fatalities are in Eastern and Southern Asia and Latin America also seems to have a fair share. Now that I have this, what about changing the code to reflect the number of mass movement events and size of these events?
Now we can clearly notice that whilst the number of mass movements is lower in China, Indonesia and India; the fatality rate is much higher. What about the mode of mass movement triggers? Here I use a function to calculate the mode. This is courtesy of Ken Williams, from his answer to a question on Stack Overflow
OK, interesting to see that downpour is the largest result of mass movement, nearly everywhere. However in Europe, Afghanistan, Oman, Russia and some other countries, rain is the largest trigger. In areas affected by tropical cyclones, this is often also the largest trigger, such as in Mexico, Cuba, Madagascar and Taiwan.
The last plots I want to do, are heat maps.
Cool! It makes sense that very large landslides occur from May to August, as most deaths are in India and China, where the monsoon season occurs in July through to September, and heavy rains already start in May. Maybe splitting this into seasons will give us a better view?
So what variables can we use in our model?
##
## Calls:
## m1: lm(formula = fatality_count ~ injury_count, data = subset(fatality_subset,
## fatality_count < 1000))
## m2: lm(formula = fatality_count ~ injury_count + time_of_year, data = subset(fatality_subset,
## fatality_count < 1000))
## m3: lm(formula = fatality_count ~ injury_count + time_of_year + landslide_size,
## data = subset(fatality_subset, fatality_count < 1000))
##
## ===================================================================
## m1 m2 m3
## -------------------------------------------------------------------
## (Intercept) 5.901*** 4108.560*** 4295.221***
## (0.714) (866.554) (859.275)
## injury_count 0.691*** 0.679*** 0.557***
## (0.036) (0.036) (0.037)
## time_of_year -2.035*** -2.117***
## (0.430) (0.426)
## landslide_size: .L 62.458***
## (8.490)
## landslide_size: .Q 27.098***
## (7.097)
## landslide_size: .C 6.035
## (5.568)
## landslide_size: ^4 3.257
## (3.562)
## -------------------------------------------------------------------
## R-squared 0.271 0.288 0.357
## adj. R-squared 0.271 0.286 0.353
## sigma 21.965 21.729 20.984
## F 363.834 197.112 87.073
## p 0.000 0.000 0.000
## Log-likelihood -4412.732 -4401.617 -4231.571
## Deviance 471384.152 460801.268 414774.111
## AIC 8831.464 8811.234 8479.142
## BIC 8846.123 8830.780 8517.985
## N 979 979 949
## ===================================================================
The most interesting relationship was between my time_of_year variable and fatality_count. This clearly showed that fatalities are decreasing among the mudslide and and landslide mass movements, which are by far the largest mass movements categories. Meanwhile the number of these mass movement events have increased over the same period of time
Yes, I created a linear model with fatality count below a 1000 and injury count.
The model explained only 36% of the variance in the number of fatalities. Based on my plots I used the time_of_year and landslide_size variables, which both improved the R^2 value substantially. The landslide size variable had the greatest effect on the R^2 value, improving this by almost a fifth. ——
The deaths claimed by landslides seems to be deceasing by about 50% from 2007 until 2017. The deaths from mudslides decrease by more than 50% in the same time period. This is really interesting as it shows that the two major causes of death among mass movements are slowly becoming less deadly.
While the previous plot showed that deaths claimed by landslides and mudslides seem to be deceasing. This plot clearly shows that there is an increase in the number of these mass movements. So this means while there is a increase in events, there is a decrease in fatalities at the same time.
Summer is the most deadly season due to mass movement. In the plot we can clearly see that very large landslides in Summer are the largest killer. Surprisingly Winter and Autumn seem to be the least deadly season.
This was a data set that I found myself, so naturally I had to do a bit of cleaning up to do first. I removed columns that I knew would not be helpful and had no use, such as edited_date or photo_link. I started off my exploration by looking at the the fatality and injury count variables which I instantly noticed when I first started looking at the NASA Global Landslide Catalog (GLC) data. This showed me that extremely deadly events were not common, which was a important stepping stone in my analysis.
I also created another data frame after my new, clean landslide_data data frame, called fatality_subset. I used this data frame most of the time as it excluded all fatality_count values which were zero or NA. Next I found that extracting information from existing columns could become very helpful. I first did this when I extracted the year and month from the original event_date variable. This process helped me immense and many of my further explorations were based on these variables.
Using the landslide_size variable also became more and more important. This variable showed important insights and in my Bivariate plot section I discovered that the size of landslides was based on the median value of fatality, with the largest mass movements having the largest median fatalities. I also wanted to explore where these mass movements took place so I started using the longitude variable which suggested that a majority of events and deaths occurred in the northern hemisphere and using maps in my multivariate analysis I found which countries were affected the most.
The most difficult parts was thinking which insights were really of importance and what variables could help in predicting further landslides. Also a problem was the lack of numerical variables in the data, which made me rely heavily on the fatality_count variable. I also struggled in creating the maps using the map_data() function, I finally got the idea of using the dplyr package to group and summarize the data. I would count creating the maps and the insights gained from the graphs in the final plots section as the greatest successes from analyzing the data.
In future I would think that I would want to use regex or the stringr package to extract the information in the event description variable. Analyzing the sources of information might also be helpful in gaining better insights.